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Abstract 

Detecting similarities between ligand binding sites in the absence of global homology between target proteins has been 
recognized as one of the critical components of modern drug discovery. Local binding site alignments can be constructed 
using sequence order-independent techniques, however, to achieve a high accuracy, many current algorithms for binding 
site comparison require high-quality experimental protein structures, preferably in the bound conformational state. This, in 
turn, complicates proteome scale applications, where only various quality structure models are available for the majority of 
gene products. To improve the state-of-the-art, we developed eMatchSite, a new method for constructing sequence order- 
independent alignments of ligand binding sites in protein models. Large-scale benchmarking calculations using adenine- 
binding pockets in crystal structures demonstrate that eMatchSite generates accurate alignments for almost three times 
more protein pairs than SOIPPA. More importantly, eMatchSite offers a high tolerance to structural distortions in ligand 
binding regions in protein models. For example, the percentage of correctly aligned pairs of adenine-binding sites in weakly 
homologous protein models is only 4-9% lower than those aligned using crystal structures. This represents a significant 
improvement over other algorithms, e.g. the performance of eMatchSite in recognizing similar binding sites is 6% and 13% 
higher than that of SiteEngine using high- and moderate-quality protein models, respectively. Constructing biologically 
correct alignments using predicted ligand binding sites in protein models opens up the possibility to investigate drug- 
protein interaction networks for complete proteomes with prospective systems-level applications in polypharmacology and 
rational drug repositioning. eMatchSite is freely available to the academic community as a web-server and a stand-alone 
software distribution at http://www.brylinski.org/ematchsite. 
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This is a PLOS Computational Biology Software Article 
Introduction 

The ability of proteins to perform their molecular functions 
often associates with the reversible binding of a variety of small 
molecules, e.g. metabolites, neurotransmitters, hormones, and 
peptides. Ligand binding occurs on specific interaction sites, where 
depressions and pockets are formed at a protein molecular surface 
to facilitate binding through various non-covalent intermolecular 
forces including hydrogen bonds, electrostatic, and van der Waals 
interactions. These direct protein-ligand contacts along with the 
solvation and desolvation effects play a key role in the association 
process determining the strength of binding, or binding affinity 
[1]. Importantly, the specificity of binding sites towards small 
molecules arises from their chemical composition as well as 
geometric features. Many disease conditions can be directly linked 



to the cellular activities of certain molecular targets, modulating of 
which can restore homeostasis. Therefore, altering molecular 
functions of proteins using high-affinity compounds is a key 
strategy in pharmacotherapy. In particular, structure-based drug 
discovery involves the development and further optimization of 
synthetic and semi-synthetic compounds to target specific proteins 
of pharmacological relevance [2,3]. Since modern drug discovery 
is routinely supported by computational approaches, such as 
virtual screening [4,5] and quantitative structure-activity relation- 
ship methods [6,7], the accurate modeling of protein-ligand 
interactions is of a paramount importance for the development of 
new and effective biopharmaceuticals. 

Selectivity of binding remains a salient issue in pharmacology. 
Selective compounds have a tendency to bind to a limited number 
of different molecular targets in the cell, whereas those more 
promiscuous may affect the activity of a larger group of proteins 
often leading to adverse effects. The classical picture of very 
selective drug binding has been challenged by recent experimental 
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and computational studies, which strongly suggest that the space of 
protein-drug interactions is dense and highly connected [8]. 
Several independent studies attempted to estimate the promiscuity 
of protein-drug interactions; for instance, a large-scale across- 
target activity analysis carried out for 189,807 active compounds 
from PubChem revealed that the majority (62%) of them exhibit 
activity against multiple, often unrelated targets [9] . Furthermore, 
a similar study conducted using a set of 3,138 compounds tested 
on up to 79 targets reported that 47% and 24% of the compounds 
can be classified as "promiscuous" and "highly promiscuous", 
respectively, with multiple targets hit at the IC50 of <10 U.M [10]. 
Finally, a thorough survey carried out for a network of 5,215 drug- 
target interactions connecting 829 drugs with 557 targets 
estimated that the average number of target proteins per drug is 
as high as 6.3 [11]. These numbers clearly indicate a high 
complexity of the protein-drug interaction space, however, most of 
the available data cover only a small subset of the "druggable" 
human proteome, which likely consists of > 3,000 drug targets 
[12]. Moreover, the interaction space is covered non-uniformly 
with a couple of hundreds of the most actively pursued targets 
covering 90% of the testing compounds [13]. 

Clearly, new approaches that can address these issues and 
effectively support modern drug discovery are needed. Over the 
past decade, we observed a growing interest in computational 
methods that could give insights into the nature of protein-drug 
interactions. Classical algorithms for the detection of relationships 
between proteins widely used in bioinformatics and structural 
bioinformatics cannot be applied to explore drug cross-reactivity 
because many compounds bind to multiple proteins that are 
completely unrelated to each other at the global sequence and 
structure levels. For example, celecoxib, an inhibitor of cycloox- 
ygenase-2, exhibit nanomolar affinity to an unrelated enzyme, 
carbonic anhydrase [14]. Telmisartan, an angiotensin II receptor 
antagonist used in the management of hypertension also acts as a 
partial agonist of the peroxisome proliferator-activated receptor-y 
that regulates fatty acid storage and glucose metabolism [15]. 
Therefore, investigating drug cross-reactivity requires a different 
set of tools. Many of these explore ligand chemistry [16], similarity 
of gene expression profiles [17] or literature-mined side effects 
[18]. A direct comparison of binding sites has the capability to 
describe ligand binding at the molecular level providing useful 
insights into the drug mode of action. On that account, it is 
considered one of the most promising computational tools in 
computer-aided drug design and the prediction of biological 
function [19]. 

Most of the algorithms for binding site matching fall into one of 
two categories: alignment-free and alignment-based methods. 
Geometric hashing is a typical example of the alignment-free 
approach; it measures the overall similarity of two binding sites, 
however, without providing structural information on the putative 
ligand binding mode and its molecular interactions with the target 
protein. For instance, PocketMatch represents binding sites as the 
sorted lists of inter-residue distances that capture their shapes and 
chemical properties [20]. The comparison of binding sites is 
performed in a frame invariant manner by aligning the distance 
lists rather than residue coordinates. A pocket similarity is then 
computed based on the overlap between two ordered sequences of 
distances. Another example is SitesBase, a binding sites database 
that allows for a rapid retrieval of similar pockets, regardless of the 
global protein sequence and fold similarities [21]. Here, the 
underlying algorithm uses geometric matching at the level of 
atomic triplets to detect common features through the identifica- 
tion of cliques and maximum common sub-graphs; the similarities 
between local environments indicate both structural and function- 



al relationships [22]. Templates used in the geometric hashing- 
based comparison of ligand binding pockets can be automatically 
derived from protein structures as demonstrated in the TESS 
program [23]. This algorithm employs a grid representation of 
functionally relevant sites, constructed based on reference frames 
defined individually for each of the 20 standard amino acid side 
chains. Surrounding atoms within a user-defined distance are first 
assigned to grid points; subsequently, the grid positions and the 
corresponding atomic labels are converted into a hash table for a 
rapid database searching. Templates automatically derived by 
TESS for the catalytic triad of ribonucleases and lysozymes have 
been used to identify several functionally interesting hits in the 
Protein Data Bank (PDB) [24]. 

In contrast to alignment-free techniques, methods based on 
binding site alignments elucidate why two sites are similar, identify 
the sets of atoms/residues that contribute to the similarity and 
describe putative ligand binding modes. However, a direct 
comparison of binding sites is more complicated and requires 
reliable sequence order-independent alignment techniques. Sev- 
eral such methods have been reported recently; for instance, 
SOIPPA performs sequence order-independent profile-profile 
alignments of binding pockets using a coarse-grained representa- 
tion of protein structures [25] . This algorithm integrates geomet- 
ric, evolutionary and physical information into a unified frame- 
work and assesses the alignment significance using the extreme 
value distribution model [26]. SuMo (Surfing the Molecules) was 
one of the first approaches to use a residue-independent 
stereochemical group description combined with a fast graph- 
based comparison heuristic to compare protein structures and 
substructures [27]. Its successor, MED-SuMo, was significantly 
improved to include functional annotation capabilities, new 
chemical features and a cavity-detection algorithm [28]. The 
effectiveness of MED-SuMo in detecting binding sites with similar 
structure-activity profiles was demonstrated using a large dataset of 
purine-binding proteins [29]. Another method, SiteEngine, 
employs low-resolution molecular surfaces constructed by con- 
verting triangles of physicochemical properties into a discrete set of 
chemically important surface points [30]. Assuming no sequence 
and fold similarity, SiteEngine offers hierarchical scoring schemes 
for global, local and global-local surface matching between 
proteins. A similar approach, ProBiS, recognizes structurally 
similar sites by analyzing patterns of physicochemical properties 
on the protein surface [31]. Using a fast maximum clique 
algorithm, this method also performs the comparison of complete 
protein surfaces. A clique-detection algorithm is also implemented 
in Cavbase [32] to compare cavities identified by Ligsite [33] using 
the degree of overlap between their exposed physicochemical 
properties. Cavbase employs 3D descriptors in the form of 
pseudocenters representing points important for molecular recog- 
nition, e.g. hydrogen bonds, hydrophobic and hydrophilic 
contacts. The application of Cavbase to the human kinome 
created a "binding pocket space", which was shown to be highly 
effective in rationalizing the cross-reactivity between unrelated 
kinases [34] . In contrast to a sequence-based classification, which 
is often unable to detect cross-relations between individual kinases, 
approaches such as Cavbase provide useful insights to support the 
development of more selective drugs. 

Ligand binding sites can be represented by "clouds" of atoms 
having certain properties, e.g. types, partial charges, etc., as 
implemented in the sup-CK algorithm [35]. Sup-CK assesses the 
similarity between two pockets using a convolution kernel upon 
the optimal alignment of their atomic "clouds". A recently 
developed method, TIPSA, employs the iterative closest point 
algorithm to superpose and compare binding pockets using the 



PLOS Computational Biology | www.ploscompbiol.org 



2 



September 2014 | Volume 10 | Issue 9 | e1 003829 



Alignment of Ligand Binding Sites in Protein Models 



atom-level representation of protein surfaces [36]. The maximum 
number of superposable atoms between two binding sites is 
identified based on the initial local alignments derived from 3D 
Delaunay triangulations. To increase the prediction accuracy, 
TIPSA incorporates additional global geometric information, the 
radius of gyration of binding site atoms, and an effective nearest 
neighbor classification scheme. Another example of a method that 
employs sequence order-independent alignments of binding 
surfaces is Solar (Signature Of Local Active Regions) [37]. This 
approach introduces a concept of signature binding sites and 
signature basis sets designed to capture information about the 
conserved and variable atomic positions at multi-resolution levels. 
Interesting features of Solar include hierarchically organized 
degrees of partial structural similarity, and an effective procedure 
for the identification of residues and atoms that are important for 
binding affinity and specificity, as demonstrated for metalloendo- 
peptidase enzymes. Despite the encouraging progress in the 
development of sequence order-independent algorithms for ligand 
binding site alignment, many of these approaches require high- 
quality binding sites extracted from either experimental protein 
structures complexed with ligands or close homology models 
constructed using holo-templates in order to achieve a high 
accuracy. 

To mitigate this issue, we developed eMatchSite, a new 
algorithm that performs sequence order-independent local binding 
site alignments using computer-generated protein models. In 
addition to its high tolerance to distortions in the target structures, 
eMatchSite also aligns predicted ligand binding sites that may 
contain inaccuracies in the definition of binding residues. A key 
feature responsible for its high performance is the extensive use of 
evolutionary information that can be extracted from only weakly 
homologous templates complexed with ligands. Essentially, the 
current work extends ideas already explored in binding pocket 
prediction by algorithms such as FINDSITE [38] and its 
successor, «FindSite [39], to address the problem of aligning and 
quantifying the similarities between ligand binding sites in 
proteins. The performance of «MatchSite is evaluated using 
several datasets and compared to other algorithms for binding site 
matching in large-scale benchmarking calculations. The results 
demonstrate that eMatchSite maintains its high prediction 
accuracy against protein models, which should prove useful in 
systems-level applications, such as polypharmacology and rational 
drug repositioning. 

Design and Implementation 

«MatchSite is a sequence-order independent algorithm for 
ligand binding site alignment and comparison. It employs a set of 
residue-level scores extracted from weakly homologous template 
proteins complexed with small molecules that cover various 
properties of binding ligands and residues. Evolutionary informa- 
tion is included as sequence profiles and entropy, as well as 
secondary structure profiles. Hydrophobicity parameters for 
amino acids, the spatial distribution of residues and ligand binding 
probabilities capture physicochemical and structural characteris- 
tics of protein residues and their interactions with small molecules. 
An important component is the chemical matching of template- 
bound ligands that effectively explores the conservation of binding 
site chemistry and ligand binding geometry across sets of 
functionally related proteins. Individual scores are combined 
using non-linear machine learning models and the alignments of 
binding sites are constructed by the Kuhn-Munkres algorithm 
[40,41] (also known as the Hungarian method) for solving 
assignment problems. 



Validation of the fold-independent matching of ligand binding 
sites requires specific datasets of proteins that bind chemically 
similar ligands despite having different sequences and structures. 
In this study, we use four datasets, the SOIPPA dataset of adenine- 
binding proteins [25], the Kahraman and Homogeneous datasets 
comprising a variety of small molecules [35,42], and the Steroid 
dataset of pharmacologically relevant steroid-binding proteins. In 
addition to the crystal structures of target proteins, we constructed 
high- and moderate-quality models to assess the performance of 
binding site matching using computer-generated structures. 
Moreover, we focus on predicted binding sites that may contain 
some inaccuracies in binding residue definition rather than 
experimental pockets. 

The performance of eMatchSite is compared to several other 
predictors, SOIPPA [25], PocketMatch [20], SiteEngine [30] and 
sup-CK [35] . These approaches represent a variety of computa- 
tional techniques developed to compare ligand binding sites in 
proteins, including geometric hashing, surface-based methods and 
sequence order-independent profile-profile alignments. Local 
predictors are also compared to two naive approaches that employ 
global sequence and structure alignments of target proteins. Using 
global similarity helps detect any possible bias that may be present 
in a particular dataset, i.e. pairs of proteins that bind similar 
ligands may also be related at the global sequence and/or 
structure level making them relatively easy targets. In the 
subsequent sections, we provide a detailed description of the 
datasets, eMatchSite implementation, evaluation metrics, and 
validation protocols used in this study. 

Datasets 

The primary dataset used in this study to train and cross- 
validate machine learning models implemented in «MatchSite 
comprises adenine-binding proteins as well as control proteins that 
do not bind ligands containing the adenine moiety. This dataset 
was compiled previously to benchmark the performance of 
another binding site alignment algorithm, SOIPPA [25]. Accord- 
ing to the SCOP classification [43], target proteins represent 167 
superfamilies and 146 folds. Ligands included in this dataset are 
adenosine-5' -diphosphate (ADP), adenosine-5' -triphosphate 
(ATP), flavin-adenine dinucleotide (FAD), nicotinamide-adenine- 
dinucleotide (NAD), S-adenosyl-L-homocysteine (SAH), and S- 
adenosylmethionine (SAM). Control ligands in the SOIPPA 
dataset form 48 chemically representative clusters at a Tanimoto 
coefficient [44] threshold of 0.7. 

In addition, we assess the performance of binding site matching 
using two other datasets. The Kahraman dataset was previously 
developed to analyze the shapes of protein binding pockets with 
respect to the shapes of their ligands [42] . This dataset comprises 
proteins bound to adenosine monophosphate (AMP), 3-(5-hy- 
droxy-5-androsten-17-one (AND) adenosine-5 '-triphosphate 
(ATP), estradiol (EST), flavin-adenine dinucleotide (FAD), flavin 
mononucleotide (FMN), a-D-glucose (GLC), protoporphyrin IX 
containing Fe (HEM), and nicotinamide-adenine-dinucleotide 
(NAD). The Homogeneous dataset was compiled to benchmark 
the performance of sup-CK, a method to quantify the similarity 
between binding pockets [35]. It consists of proteins complexed 
with the following ligands: pentaethylene glycol (1PE), B- 
octylglucoside (BOG), glutathione (GSH), lauryl dimethylamine- 
N-oxide (LDA), palmitic acid (PLM), 4'-deoxy-4'-aminopyridoxal- 
5'-phosphate (PMP), S-adenosylmethionine (SAM), sucrose 
(SUC), and uridine-5' -monophosphate (U5P). Although some 
ligands, e.g. 1PE and BOG, may bind non-specifically to proteins 
and are used to facilitate the crystallization process, we keep them 
in the dataset to make the results comparable to those reported in 
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the original publication [35]. When assessing the performance 
using the Kahraman and Homogeneous datasets, positives are 
defined as pairs of proteins that bind exactly the same ligand, 
whereas those proteins that bind different ligands are considered 
negatives. 

The last dataset contains 8 pharmacologically relevant steroid- 
binding proteins complexed with 1 7 P-estradiol (EST), estradiol- 
1 7P-hemisuccinate (HE7), and equilenin (EQU). As the control 
dataset, we use 1 ,854 proteins that bind small molecules whose size 
is comparable to that of steroids (15-25 heavy atoms), however, 
these ligands have different chemical structures with a Tanimoto 
coefficient [44] vs. EST of £0.1. Control ligands in the Steroid 
dataset form 334 chemically representative clusters at a Tanimoto 
coefficient threshold of 0.7. According to the SCOP classification 
[43], target proteins represent 185 superfamilies and 150 folds. 

Target structures 

In addition to the crystal structures of target proteins, we 
constructed weakly homologous protein models for the SOIPPA, 
Kahraman, Homogeneous and Steroid datasets. The models were 
assembled using template-based modeling by eThread [45,46], 
excluding those templates whose sequence similarity to the target 
is >40%. First, we built up to 20 models for each target, 10 using 
e Thread/Modeller and 10 using eThread/TASSER-Lite. Then, 
one model with a TM-score to native of >0.7 was randomly 
selected and included in the high-quality dataset. Similarly, a 
randomly selected model with a TM-score of 0.4-0.7 was included 
in the moderate-quality dataset. Other than crystal structures and 
weakly homologous models, the SOIPPA dataset also comprises 
artificially distorted structures, whose Coc-RMSD is within a 
narrow range (RMSD stands for a root-mean-square deviation). 
These structures were constructed by distorting the native 
conformation using an in-house software that employs conforma- 
tional Monte Carlo sampling to reach the desired RMSD from 
native while preserving the secondary structure content [47]. 
Specifically, for each target in the SOIPPA dataset, we built three 
non-native structures with a Coc-RMSD of 2 A, 4 A and 6 A. 

Ligand binding site prediction 

Ligand binding sites were identified in target proteins using 
eFindSite, a recently developed template-based approach [39,48]. 
Similar to structure modeling, binding pocket prediction was 
performed using only weakly homologous templates with a 
sequence identity to the target of ^40%. In pocket matching 
calculations, we used only those proteins, for which the center of 
each of the best of top five binding sites is predicted within a 
distance of 8 A from the experimental pocket center, with the 
corresponding Matthew's correlation coefficient calculated over 
binding residues of ^0.4. The accuracy of ligand binding site 
prediction certainly depends on the quality of target structures 
[39], therefore, as shown in Tables 1 and SI, the structural subsets 
of the SOIPPA, Kahraman, Homogeneous and Steroid datasets 
(crystal structures, high- and moderate-quality models as well as 
distorted conformations) comprise different numbers of proteins. 

Implementation of eMatchSite 

A unique feature of eMatchSite is its capability to estimate 
pairwise Ca-Ca distances between binding residues upon the 
alignment of two pockets using machine learning and a set of seven 
residue-level scores. These features cover various physicochemical 
and geometric characteristics and, importantly, can be extracted 
from only weakly homologous template structures identified by 
eFindSite. Residue-level scores implemented in «MatchSite 
employ sequence and secondary structure profiles, hydrophobicity 



parameters for amino acids, ligand binding probabilities, the 
spatial distribution of neighboring residues, sequence entropy, and 
the chemical matching of template-bound ligands. 

Sequence profile score. For each target protein, a sequence 
profile is constructed using PSI-BLAST [49] and a non-redundant 
sequence database (nr) from NCBI [50]. The nr database was 
filtered to remove low-complexity regions, transmembrane and 
coiled-coil segments [51]. Given a pair of residues i and j, the 
sequence profile score, SfJ 1 , is the dot product of their profile 
vectors: 

S7=£U?x^) (1) 
k=\ 

where seqf is the value for the amino acid k in the i-th position of 
the sequence profile of the first protein, and seqj is the value for 
the amino acid k in the j-th position of the sequence profile of the 
second protein. 

Secondary structure score. PSIPRED [52] is used to assign 
three probability values to each residue corresponding to an ot- 
helix, a P-structure and a loop conformation. The secondary 
structure score for a pair of residues i andj, Sfj, is the Euclidean 
distance between their secondary structure probability vectors. 

S U =\l{pf-Pff+ {Pf-Pff+ {Pf-Pff ( 2 ) 

where pf, pf and pf are, respectively, the probability for a-helix 
(Helix), P-structure (Extended) and loop (Coil) assigned by 
PSIPRED to the 2-fh residue in the first protein, p? , p? and pf 
are the equivalent values for the j -th residue in the second protein. 

Hydrophobicity score. Each residue type is assigned a 
vector of 20 hydrophobicity parameters according to the 
following experimental and theoretical hydrophobicity scales 
for amino acids: Abraham and Leo [53], Black and Mould [54], 
Brylinski el al. [55], Bull and Breese [56], Cowan and Whittaker 
[57], Eisenberg et al. [58], Fauchere and Pliska [59], Guy [60], 
Hopp and Woods [61], Janin [62], Kyte and Doolittle [63], 
Manavalan et al. [64], Miyazawa and Jernigan [65], Parket et 
al. [66], Rao and Argos [67], Roseman [68], Tanford [69], 
Welling et al. [70], Wilson et al. [71], and Wolfenden et al. [72]. 
The hydrophobicity score, Sjj h , corresponds to the Pearson 
correlation coefficient calculated between two hydrophobicity 
vectors for residues i and j : 



»(E^)-C 


>»(») 


JH»)-(i>) 2 ] 





where n is the number of hydrophobicity scales (20), /i, and hj 
are hydrophobicity parameters for residues i (first protein) andj 
(second protein), respectively. The summations of hydropho- 
bicity parameters (h, and hj), squared (hj and hj) and paired 
(hihj) values are taken over 20 hydrophobicity scales. 

Binding probability score. eFindSite assigns a ligand 
binding probability to each predicted binding residue in the 
protein target [39]. The binding probability score, Sfj 1 , is a 
squared difference between the binding probabilities assigned to a 
pair of residues i andj: 
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Table 1. Global and local structure quality of adenine-binding proteins 


from the SOIPPA dataset. 






Dataset 


Number of targets 


Global structure 


Ligand binding pocket 








Ca-RMSD [A] TM-score 


RMSCf [A] 


Distance I A] 


MCC 


Crystal structures 


211 






1.7±1.4 


0.70±0.10 


High-quality models 


202 


4.4±2.4 0.83±0.07 


2.0 ±1.4 


1 .8 ± 1 .4 


0.67 ±0.10 


Moderate-quality models 


174 


13.2±4.6 0.54±0.10 


5.7±3.5 


1.9±1.3 


0.62 ±0.10 



a Heavy-atom RMSD calculated over binding residues. 

b Distance between predicted pocket center and the geometric center of bound ligand. 
c Matthew's correlation coefficient for predicted binding residues. 

High- and moderate-quality models are constructed by eThread. Ligand binding sites and residues are detected by eFindSite. 
doi:10.1371/joumal.pcbi.1003829.t001 



Pi 



(4) 



pf™ and Pj is a ligand binding probability assigned by 
eFindSite to z-th residue in the first protein andj-th residue in the 
second protein, respectively. 

Neighbor distribution score. For each binding residue, we 
first calculate the distribution of Cot distances to all other residues 
in the same pocket: 



d, = (d\,d 2 , 



(5) 



where d, is a vector of distances between 2-th residue and the 
remaining binding residues in the first protein, enumerated from 
d } to d N .i; dj is the equivalent vector for the j-th residue in the 
second protein. 

Then, given a pair of residues i and j belonging to different 
pockets, we compare their neighbor distance distributions, d, and 
dj, using the non-parametric Fisher-Pitman permutation test for 
independent samples [73]. The T-value returned by this test is 
used as the neighbor distribution score, S"j b . 

Sequence entropy score. From sequence profiles generated 
by PSI-BLAST, the amino acid variability at a given residue 
position is quantified using the Shannon entropy, which provides a 
simple measure of uncertainty in a data set [74] . The sequence 
entropy score, Sf" , is a squared difference between individual 
entropies calculated for a pair of residues i and j: 



c*eni _ 



20 
k=\ 



log 2 seqf 



serf log 2 seq) 

k=l 



(6) 



where seqf and seqj have the same meaning as in Eq. 1 . 

Template ligand score. eFindSite predicts binding sites 
using evolutionarily related holo-templates. Template structures 
are superposed onto a target protein and template-bound ligands 
are transferred to the target upon the global structure alignment. 
eMatchSite uses these ligands to position two target proteins 
relative to each other. Specifically, atomic equivalences are 
established between two template ligands (one from each target 
protein) using kcombu, a heuristic build-up algorithm for 
determining one-to-one atom correspondences between chemical 
compounds [75]. Next, the two target proteins are oriented in 
space according to the superposition of template ligands and 
pairwise Cot-Cot distances between binding residues in the targets 
are calculated. We repeat this procedure m xn times, where m and 



n are the number of template ligands collected by eFindSite for the 
first and the second target protein, respectively. Given a pair of 
binding residues i and j from both targets, the template ligand 
score, Sfj, is a weighted average distance between their Cot atoms 
calculated for all template ligand combinations: 



n m 

<-,% _ k=l /=! 

ij n m 

k=l 1=1 



(7) 



where a weight w^i corresponds to the squared Tanimoto 
coefficient [44] between template ligands k and / reported by 
kcombu. Thus, the contribution from highly similar ligand pairs is 
larger than from those chemically less similar. ajy is a distance 
between the Cot atoms of residues i in the first protein andj in the 
second protein when their structures are oriented according to the 
alignment of template ligands h and /. 

Note that the set of seven residue-level scores, Sfj, Sff, Sn , 
Sfj 1 , S"j b , SjV and Sfj, are calculated for putative binding sites 
identified by eFindSite without using any information on the 
actual target-bound ligands. Therefore, this procedure can be 
applied to experimental structures in their apo conformations as 
well as to computer-generated protein models. Next, we 
constructed a machine learning model to estimate distances 
between the Ca atoms of residues belonging to the two target 
pockets upon their optimal local alignment. Reference distances 
are calculated upon the superposition of protein structures using 
the coordinates of bound ligands. SVR is used to predict these 
distances using the set of seven residue-level scores; here, we use 
the SVR implementation from libSVM 3.14 [76]. Machine 
learning model is cross-validated against the SOIPPA dataset. We 
use a non-exhaustive 6-fold cross-validation, where a subset of 
dataset proteins binding to a particular ligand are excluded, the 
model is trained on the remaining cases and Cot-Cot distances are 
predicted for the excluded group. This procedure is applied to all 6 
ligands in the SOIPPA dataset. In addition to the SVR model, we 
also evaluated an equivalent procedure employing SVC using the 
same set of seven residue-level scores. Here, rather than estimating 
Cot-Cot distances, the model predicts whether a pair of binding 
residues align to each other upon the optimal local superposition 
of two binding sites. We found that the algorithm based on the 
SVR model performs slightly better than that using SVC, 
therefore the latter was not pursued further. 

Using the machine learning-based procedure described above, 
we calculate an all-against-all matrix containing the estimated Cot- 
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Cot distances between residues belonging to two target pockets. 
The optimal alignment is found by applying the Kuhn-Munkres 
algorithm [40,41] to identify a unique set of residue pairs that give 
the shortest overall distance between their Ca atoms. This 
technique, also known as the Hungarian method, solves combi- 
natorial assignment problems in polynomial time. The sum of Cot- 
Cot distances for the solution is guaranteed to be the smallest 
amongst all possible alignment combinations. Moreover, this 
algorithm produces fully sequence order-independent alignments, 
whose length is equal to the number of binding residues in the 
smaller pocket. 

Pocket similarity score 

Finally, optimal alignments of pairs of ligand binding pockets 
are assigned a similarity score corresponding to the probability 
that these sites bind similar ligands. The similarity score is 
calculated using machine learning and an input vector of the 
following features: a Coc-RMSD calculated over equivalent 
binding residues, average residue-level scores, a chemical corre- 
lation, the physicochemical properties of putative binding ligands, 
and geometric hashing. 

Pocket RMSD. The geometric fit between two pockets, 
pRMS^ correS p 0nc i s ( 0 tne minimum Coc-RMSD calculated for 
residue equivalences from the optimal alignment. 

Average residue-level scores. In addition to the actual 
RMSD between two pockets, we include the predicted SVR and 

S VR S VC 

SVC scores averaged over aligned residue pairs, F and F : 



1 M 

(<V) 



7 svc _ 1 v svc 
(ij) 



(8) 



(9) 



where the sum is taken over M aligned residue pairs (ij) between 
the two proteins. is a score reported by SVR that 

corresponds to the expected distance between Ca atoms of 
equivalent binding residues (ij) and is a score reported by 

SVC that gives the probability that residues (ij) align to each 
other. 

Chemical correlation. eFindSite employs molecular finger- 
prints constructed for ligands extracted from evolutionarily related 
templates to conduct ligand-based virtual screening against 
predicted binding pockets [48]. «MatchSite uses this capability 
to perform virtual screening against the two predicted pockets and 
calculates the Kendall x rank correlation coefficient, F TAU : 



7 tau_ n c -n D 



(10) 



n(n— 1) 



where tic and no are the numbers of concordant and discordant 
pairs, respectively; the denominator is the total number of pair 
combinations. Any pair of library compounds is concordant if their 
ranks in the ordered lists for the two pockets agree, i.e. one 
compound is consistently ranked higher than the other. Pairs of 
compounds whose relative ranks are swapped in the two ordered 
lists are considered discordant. To perform virtual screening, we 
compiled a small library of 23,659 molecules selected from the 
ZINC collection of organic compounds by removing the 



redundancy at a Tanimoto coefficient [44] threshold of 0.5 using 
the SUBSET program [77]. The chemical correlation was 
formulated previously to construct a cross-reactivity virtual profile 
for the human kinome [78] . 

Physicochemical properties. Each ligand binding site 
identified by «FindSite is also assigned a set of consensus 
physicochemical properties of putative binding ligands, including 
the molecular weight (MW), the octanol/water partitioning 
coefficient (logp), the polar surface area (PSA), and the number 
of hydrogen bond donors and acceptors (HBD and HBA, 
respectively) [39] . As a physicochemical feature, F FCF , we average 
the differences between two binding pockets with respect to these 
properties: 



7 pcf . 



\ P\st~ P 2nd\ 



(11) 



where the sum is taken over the five abovementioned physico- 
chemical properties, and P\ st and P<ind are the binding pockets in 
the first and second protein, respectively. 

Geometric hashing. The last feature is an alignment-free 
matching score calculated using geometric hashing. Here, we 
implemented in «MatchSite a scoring scheme from PocketMatch, 
which represents each binding site as a sorted list of 90 distances 
between Ca, Cp" atoms, and the side chain geometric centers for 
amino acid residues arranged into 5 groups: group-0: A, V, I, L, 
M, G, P; group- 1: K, R, H; group-2: D, E, Q, N; group-3: Y, F, 
W; and group-4: C, S, T [20]. The pairs of distance-sets are 
aligned using a greedy strategy and the similarity score is 
calculated as the average fraction of matching elements across 
the sorted lists of distances. This feature in eMatchSite is denoted 
by F PMS , where PMS stands for the original PocketMatch score 
[20]. 

The pocket similarity score is computed by combining the six 
features described above using machine learning. The training and 
validation of the machine learning model used to assess similarities 
between pairs of pockets is carried out using adenine-binding 
proteins from the SOIPPA dataset. We follow a similar 6-fold 
cross-validation protocol as described above for assessing the inter- 
residue distance prediction. Machine learning for the estimation of 
pocket similarity is implemented using the Support Vector 
Machines algorithm for classification problems provided by 
libSVM 3.14 [76]. 

Evaluation metrics 

The quality of local binding site alignments is assessed against 
reference alignments using Matthew's correlation coefficient 
(MCC): 



MCC = 



TPx TN-FPxFN 



y/(TP + FP)(TP + FN)(TN + FP)(TN + FN) 



(12) 



where TP, FN and FP are the number of correctly aligned residue 
positions, under- and overpredicted, respectively. TN is the 
number of residue pairs correctly predicted not to align to each 
other. Reference alignments are constructed by superposing a pair 
of protein structures using the coordinates of bound ligands. We 
note that similar pockets in the Kahraman and Homogeneous 
datasets are defined as those that bind the same ligand, whereas in 
the SOIPPA and Steroid datasets, similar pockets bind ligands 
containing the adenine and estradiol moieties, respectively. Here, 
the superposition is performed using the maximum common 
substructures between two ligands identified by the Small 
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Molecule Subgraph Detector (SMSD) [79]. Upon the superposi- 
tion, the reference alignment is calculated by applying the 
Hungarian algorithm [40,41] to a matrix of all-against-all 
distances between binding residue Coc atoms (a similar procedure 
is described in [36]). Subsequently, an optimal structure alignment 
of two binding sites is constructed, where the alignment length is 
equal to the number of residues in the smaller pocket. This 
algorithm guarantees that the sum of Ca-Ca distances calculated 
over aligned residue positions is the smallest amongst all possible 
alignments with the same length. 

The alignment quality is further assessed by a ligand heavy- 
atom RMSD with an underlying assumption that the correct 
alignment of binding residues would prompt two ligands to adopt 
a similar orientation. Specifically, we superpose two proteins using 
residue Ca atoms based on a given local binding site alignment, 
which is followed by calculating an RMSD for bound ligands. The 
SOIPPA, Kahraman and Homogeneous datasets contain flexible 
compounds with multiple rotatable bonds that may have different 
internal geometries when bound to different proteins. Therefore, 
we use a method for correcting the RMSD by subtracting a heavy- 
atom RMSD calculated upon the superposition of two ligands 
alone; this corrected metric is denoted by ARMSD. 

In addition to the quality of local binding site alignments, we 
assess the capabilities of different algorithms to detect those 
pockets binding similar ligands. The SOIPPA dataset comprises 
two groups of structures, adenine-binding proteins and control 
proteins that do not bind ligands containing the adenine moiety. 
Here, positives are defined as pairs of adenine-binding proteins, 
whereas pairs of an adenine-binding protein and a control protein 
are considered negatives. An analogous definition of positives and 
negatives is used for the steroid-binding and control proteins in the 
Steroid dataset. For the Kahraman and Homogeneous datasets, 
positives and negatives are pairs of proteins that bind the same and 
different ligands, respectively. The ability to detect similar binding 
sites in different proteins is assessed by a receiver operating 
characteristics (ROC) and the corresponding area under the ROC 




0.6 0.4 
MCC 



0.0 



Protein models 

High-quality 
Moderate-quality 



Distorted structures 

2k RMSD 

4A RMSD 

6A RMSD 



Figure 1. Effects of target structure distortions on the quality 
of local alignments of ATP-binding sites. MCC is Matthew's 
correlation coefficient calculated against the reference alignments 
constructed using target crystal structures. 
doi:10.1371/journal.pcbi.1003829.g001 



curve (AUC). In this analysis, a true positive rate (TPR, also called 
sensitivity) and a false positive rate (FPR, also called fall-out) are 
defined as: 



TPR- 



F PR- 



TP 



TP + FN 



FP 



FP+TN 



(13) 



(14) 



where TP, TN, FP and FN are the numbers of true positives, true 
negatives, false positives and false negatives, respectively. 

Other predictors 

The accuracy of eMatchSite is compared to that of several other 
methods. The first two represent global sequence and structure 
alignment approaches. Sequence alignments between two proteins 
are calculated by Needleman- W unsch dynamic programming [80] 
with a sequence identity used as the alignment score. Global 
structure alignments are performed by Fr-TM-align [81], where 
the alignment significance is evaluated by a TM-score [82]. In 
addition to these global similarity measures, we analyze the 
performance of *MatchSite with respect to various local binding 
site matching algorithms. PocketMatch represents an alignment- 
free, geometric hashing approach that implements a PMScore to 
measure the similarity between ligand binding sites [20]; the stand- 
alone version of PocketMatch 2.0 is used in this study. SiteEngine 
is a surface-based algorithm developed to recognize similar 
functional sites shared by proteins that have different sequences 
and folds [30]. It measures the similarity in terms of the overlap 
between the physicochemical and geometric properties of binding 
pockets. The stand-alone version of SiteEngine 1.0 was used in a 
binding site comparison mode. Sup-CK is a method that 
represents ligand binding pockets by clouds of atoms and assesses 
the pocket similarity using a convolution kernel upon the optimal 
superposition of their atomic clouds in space [35]. For each 
program, PocketMatch, SiteEngine and sup-CK, the calculations 
are conducted using the default set of parameters. Finally, 
SOIPPA is a protein functional site comparison algorithm that 
features sequence order-independent profile-profile alignments, 
which are calculated for a reduced representation of protein 
structures [25]. The comparison of ^MatchSite to SOIPPA is 
performed only for the crystal structures of target proteins, using 
supplementary data reported in the original publication of 
SOIPPA. 

Results 

Characteristics of target structures 

eMatchSite was devised specifically for applications involving 
protein models, therefore we first discuss the structural character- 
istics of dataset proteins used in this study. In addition to crystal 
structures, we perform local binding site alignment benchmarks 
using weakly homologous protein models and artificially distorted 
structures. The former are constructed using eThread, a template- 
based approach to protein structure modeling [45,46]. Table 1 
shows the structure quality of protein models generated for the 
SOIPPA dataset. High- and moderate-quality models have an 
average TM-score to native of 0.83 and 0.54, respectively; this 
corresponds to the global Ca-RMSD (local binding pocket all- 
atom RMSD) of 4.4 A (2.0 A) for high- and 13.2 A (5.7 A) for 
moderate-quality models. Structures with a comparable quality 
were constructed for the Kahraman, Homogeneous and Steroid 
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Figure 2. Prediction of aligned residue pairs using machine learning for SAH-binding proteins from the SOIPPA dataset. The 

correlation between the actual pairwise Ca-Ca distances upon the reference alignment of binding sites and those predicted by SVR is shown for (A) 
crystal structures, (B) high-, and (C) moderate-quality protein models, respectively. (D) The ROC plot for the prediction of equivalent residue pairs 
using SVC; CS - crystal structures, HQ - high-quality, MQ - moderate-quality models, R - random prediction. 
doi:10.1371/journal.pcbi.1003829.g002 



datasets; see Supporting Information, Table SI. Furthermore, to 
generate more uniform sets of non-native models, we distorted 
crystal structures to the desired RMSD with a small standard 
deviation. Table SI shows that models deformed to 2 A, 4 A and 
6 A Cot- RMSD from native have an average TM-score of 0.91, 
0.78 and 0.68, respectively; their binding sites are distorted to 
1.3 A, 2.4 A and 3.2 A all-atom RMSD. 

In addition to the target structure, binding site matching also 
requires a pre-defined set of binding residues, which can be identified 
in experimental target structures complexed with small molecules. 
However, this information is unavailable for apo conformations and 
protein models. Therefore, an algorithm for binding site alignment 
should tolerate to some extent inaccuracies in the binding residue 
definition in order to incorporate predicted binding pockets. In that 
regard, we focus on binding sites predicted using recently developed 
fFindSite [39] rather than those obtained experimentally. Table 1 
shows that the average distance between experimental and predicted 
pockets for the SOIPPA dataset is 1.7 A, 1.8 A and 1.9 A for crystal 
structures, high- and moderate-quality models, respectively; the 
corresponding average Matthew's correlation coefficient (MCC) 
calculated for binding residues is 0.70, 0.67 and 0.62. As shown in 
Table SI, binding sites for the Kahraman, Homogeneous and 
Steroid datasets are predicted with a slightly lower accuracy; 
depending on the quality of target structures, the average distance 
is 2.0-2.2 A, 2.9-3.2 A and 2.3-2.5 A, with the corresponding MCC 
of 0.59-0.65, 0.59-0.63 and 0.61-0.67, respectively. 

We also investigate how structural imperfections in protein 
models affect the alignment of binding sites. For the SOIPPA 
dataset, we first derive reference alignments of binding pockets by 
superposing ligands bound to target crystal structures. Then, we 
repeat this procedure using binding sites predicted in protein 
models as well as distorted structures to assess the alignment 
accuracy by calculating MCC vs. the reference alignments. 
Figures 1 and SI show that even minor structural imperfections 
combined with inaccuracies in binding residue prediction signif- 
icantly alter the alignments. For instance, alignments constructed 
for 22.2%, 4.1%, 48.9%, 10.9% and 3.8% pairs of ATP-binding 
sites have MCC&0.5 when high-, moderate-quality models, 
structures distorted to 2 A, 4 A and 6 A are used (Figure 1). 
Qualitatively similar accuracy is obtained for other ligands in the 
SOIPPA dataset (Figure SI). This analysis indicates that non- 
native target structures pose significant challenges to algorithms for 
local ligand binding site alignment. 



Residue-level scores extracted from weakly homologous 
templates 

^MatchSite constructs binding site alignments from all-against- 
all pairwise Ca-Coc distances estimated by machine learning using 
a set of residue-level scores. The accuracy of inter-residue distance 
prediction is critical for the alignment quality. For the SOIPPA 
dataset, Tables 2 and S2 show the Pearson correlation coefficients 
(PCC) between the actual distances upon the superposition of 
binding ligands and those predicted by Support Vector Machines 
(for regression problems, SVR). The corresponding correlation 
plots are presented in Figures 2 and S2. For example, PCC for 
proteins binding S-adenosyl-L-homocysteine (SAH) is 0.95, 0.94 
and 0.86, when the Ca-Ca distances are predicted using crystal 
structures, high- and moderate-quality models, respectively 
(Table 2 and Figures 2A-C). In addition to SVR, we constructed 
another Support Vector Machines model (for classification 
problems, SVC), which predicts aligned pairs using the same set 
of residue-level scores. The accuracy of this classifier for SAH- 
binding proteins from the SOIPPA dataset is shown in Figure 2D; 
at a fixed false positive rate of 1%, the true positive rate is 63.6%, 
60.6% and 52.6% for crystal structures, high- and moderate- 
quality models, respectively. The performance of the SVC model 
for other proteins is shown in Figure S3. These results demonstrate 
that residue-level scores extracted from evolutionarily weakly 
homologous templates can be used to accurately predict inter- 
residue distances for local binding site alignments. Furthermore, 
the SVR model performs slightly better than the SVC classifier in 
constructing the actual alignments, therefore the former is used as 
the default method in gMatchSite. 

Binding pocket alignment by eMatchSite 

The algorithm for the sequence order-independent alignment of 
binding sites implemented in eMatchSite is illustrated in Figure 3 
for two unrelated proteins, ATP-dependent DNA ligase (PDB-ID: 
laOiA) and histamine N-methyltransferase (PDB-ID: 2aotA). Both 
proteins bind ligands that contain the adenine moiety, ATP and S- 
adenosyl-L-homocysteine, respectively. However, they share little 
similarity at the global sequence and structure levels; their pairwise 
sequence identity is 23% and the TM-score between them is 0.28. 
Using crystal structures, the distance between the experimental 
pocket center and that predicted by cFindSite (MCC calculated 
over binding residues) for laOiA and 2aotA is 2.55 A (0.81) and 
1.86 A (0.68), respectively. Figure 3A shows the matrix of all- 
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against-all Coc-Coc distances estimated by machine learning using 
SVR, where the pairs of residues selected by the Kuhn-Munkres 
algorithm [40,41] to minimize the overall distance are highlighted 
in green. These pairs are translated to the sequence order- 
independent alignment of binding residues presented in Figure 3B. 
Furthermore, Figure 3C shows the superposition of two target 
proteins according to the local alignment of their binding sites; the 
Cot-RMSD calculated over equivalent residue pairs is 2.13 A. The 
alignment accuracy can be evaluated using the relative orientation 
of binding ligands upon the superposition of target proteins as 
shown in Figure 3D. In addition to experimental structures, 
Figures 3E-H show the performance of gMatchSite using weakly 
homologous protein models, whose TM-score to the crystal 
structures is 0.46 (laOiA) and 0.57 (2aotA). For these structures 
of laOiA and 2aotA, the distance between experimental and 
predicted pocket center (MCC calculated over binding residues) is 
2.92 A (0.60) and 1.97 A (0.61), respectively. Because of structural 
distortions in the target models, both the matrix (Figures 3E) and 
the alignment (Figure 3F) slightly differ from those generated using 
crystal structures; however, eMatchSite still aligns binding residues 
with a Cot-RMSD of 2.70 A. According to this alignment, both 
binding ligands adopt a similar orientation, which is shown in 
Figure 3H. These case studies illustrate the procedure implement- 
ed in «MatchSite and demonstrate that biologically correct 
sequence order-independent alignments of ligand binding sites 
can be constructed using protein models. 

Performance on the SOIPPA dataset 

The first comparative assessment of the performance of 
eMatchSite in recognizing similar binding sites in globally 
dissimilar proteins is evaluated on the SOIPPA dataset of 
adenine-binding proteins [25]. In addition to target crystal 
structures, we perform binding site matching calculations using 
high- and moderate-quality protein models as well as distorted 
conformations. Receiver operating characteristics (ROC) are 
plotted in Figures 4 and S4 to evaluate the performance of 
binding site alignment algorithms, «MatchSite, SiteEngine and 
PocketMatch, in comparison to global similarity-based approaches 
(the corresponding AUC values are reported in Table S3). Using 
global sequence similarity yields an area under the ROC curve 
(AUC) of 0.55-0.56 across all target structures. As expected, these 
results are close to random, since the target proteins were selected 
based on the low pairwise global sequence similarity [25]. 
Structure alignments produce slightly better results with the 
AUC of 0.657, 0.655 and 0.671 for crystal structures, high- and 
low-quality models, respectively, indicating that adenine-binding 
proteins are slighdy more similar at the global structure level 
compared to control proteins. In contrast, the AUC for 
eMatchSite, SiteEngine and PocketMatch using crystal structures 
is 0.941, 0.933 and 0.603, respectively; thus eMatchSite and 
SiteEngine perform comparably well, more efficiently detecting 
similar binding sites than PocketMatch. When high- (moderate-) 
quality protein models are used, the AUC for gMatchSite, 
SiteEngine and PocketMatch is 0.953 (0.987), 0.893 (0.856) and 
0.615 (0.627), respectively. We note that the SOIPPA datasets of 
crystal structures and protein models comprise different numbers 
of proteins. This is because for some non-native target conforma- 
tions, ligand binding sites were not predicted with an acceptable 
accuracy by eFindSite due to the deformations of their global 
structures. Nevertheless, binding pocket matching algorithms can 
still be compared to each other across the same set of target 
structures. On that account, the AUC for ^MatchSite is 6% 
(13.1%) higher than that for SiteEngine using high- (moderate-) 
quality protein models. 
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Figure 3. Construction of sequence order-independent binding site alignments by eMatchSite. Two target proteins are ATP-dependent 
DNA ligase (PDB-ID: 1 aOiA, yellow) and histamine N-methyltransferase (PDB-ID: 2aotA, red). Left (A-D) and right (E-H) panels show the alignment of 
binding sites in the crystal structures and protein models, respectively. (A, E) Matrices of pairwise Coi-Cot distances between two binding sites 
predicted by SVR. Residue indexes are shown in the first column and row. Sets of residue pairs that have the smallest Ca-Coe distances identified by 
the Kuhn-Munkres algorithm are highlighted in green. (B, F) Sequence order-independent alignments of two binding sites constructed from residue 
pairs that have the smallest Ca-Ca distances; arrows indicate equivalent pairs. (C, G) Protein structures are superposed according to the local 
alignment of their binding sites; binding residues and predicted pocket centers are shown as solid sticks and balls, respectively. (D, H) Relative 
orientation of binding ligands upon the local alignment of target binding sites; ATP in 1a0iA and S-adenosyl-L-homocysteine in 2aotA are shown as 
solid and transparent sticks, respectively. 
doi:1 0.1 371 /journal.pcbi.1 003829.g003 
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Figure 4. Performance of eMatchSite, PocketMatch and SiteEngine on the SOIPPA dataset of adenine-binding proteins. The accuracy 
of local alignment predictors is compared to that using global sequence and structure alignments for (A) crystal target structures, (B) high-, and (C) 
moderate-quality protein models. TPR and FPR are the true and false positive rates, respectively; gray area corresponds to a random prediction. 
doi:1 0.1 371 /journal.pcbi.1 003829.g004 
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Table 3. Comparison of sequence order-independent binding site alignments constructed by SOIPPA and eMatchSite for adenine- 
binding proteins. 





Algorithm 


RMSD threshold 


Crystal structures 


High-quality models 


Moderate-quality models 


SOIPPA 


2 A 


6.3% 






eMatchSite 




1 5.6% 


1 1 .5% 


6.5% 


SOIPPA 


5 A 


23.6% 






eMatchSite 




60.7% 


56.4% 


52.4% 



The alignment accuracy is assessed by a ligand heavy-atom RMSD calculated upon the superposition of aligned binding residues. The percentage of benchmarking 
protein pairs for which the ligand RMSD is below 2 A and 5 A is reported. 
doi:1 0.1 371 /journal.pcbi.1 003829.T.003 



Next, we assess the accuracy of the actual alignments of ligand 
binding sites between adenine-binding proteins. The performance 
comparison for eMatchSite and SOIPPA in matching adenine- 
binding sites is shown in Table 3. Here, the accuracy is evaluated 
by an RMSD calculated over ligand heavy atoms upon the 
superposition of aligned binding residues; correct alignments are 
defined as those upon which binding ligands are positioned within 
2 A and 5 A RMSD. Using crystal structures, eMatchSite 
generates almost three times more accurate alignments than 
SOIPPA. Furthermore, eMatchSite maintains its capabilities to 
construct highly accurate alignments even when protein models of 
varying quality are used. Table 3 shows that depending on the 
model quality, the percentage of correctly aligned pairs of adenine- 
binding sites is only 4—9% lower than those aligned using crystal 
structures. This is an impressive result, given that the average Cot- 
RMSD from native calculated over ligand binding residues is 2.0— 
5.7 A (Table 1). In addition to SOIPPA, we also compare the 
performance of eMatchSite to SiteEngine across different confor- 
mations of adenine-binding proteins. Table 4 reports the average 
ligand heavy-atom RMSD calculated upon the superposition of 
aligned binding residues (Table S5 shows the alignment accuracy 
separately for different ligands). eMatchSite systematically gener- 
ates more accurate local alignments than SiteEngine, with the 
ligand RMSD better by roughly 1.0 A, 1.5 A and 2.5 A when 
crystal structures, high- and moderate-quality models are used, 
respectively. These results demonstrate that eMatchSite not only 
constructs more accurate sequence order-independent binding site 
alignments, but also offers an improved tolerance to structural 
deformations in non-native protein structures. 

Performance on the Kahraman and Homogeneous 
datasets 

In the next assessment, we use the Kahraman and Homoge- 
neous datasets compiled previously to evaluate the performance of 
binding site matching algorithms. The Kahraman dataset 



comprises proteins complexed with ligands of different sizes and 
physicochemical properties [42], whereas the Homogeneous 
dataset consists of ligands whose molecular weights are compara- 
ble [35]. Similar to the SOIPPA dataset, we use three 
conformations of the target proteins, crystal structures as well as 
high- and moderate-quality models (their characteristics are 
summarized in Table SI). Figure 5 shows the performance 
assessment for eMatchSite compared to two global similarity- 
based approaches as well as three binding site matching 
algorithms, PocketMatch, SiteEngine and sup-CK (the corre- 
sponding AUC values are reported in Table S3). Using the 
Kahraman dataset, the performance of PocketMatch is compara- 
ble to the global sequence and structure alignments and only 
marginally better than random. The accuracy of sup-CK is similar 
to SiteEngine, however, the latter performs slightly better using 
modeled target structures. What stands out is that eMatchSite 
systematically outperforms both sup-CK and SiteEngine with the 
AUC larger by 3^4% for the crystal structures and by 8-12% for 
protein models. In the original Kahraman dataset, three ligands 
that contain the adenine moiety, ATP, ADP and NAD, are 
considered as different molecules, thus recognizing a significant 
similarity between, for example ATP and ADP binding sites, 
counts as false positives. Therefore, similar to the SOIPPA dataset, 
we also assess the performance of eMatchSite for adenine-binding 
pockets grouped together, which is shown as dashed black lines in 
Figures 5A-C. Using this classification, the corresponding AUC 
for crystal structures, high- and moderate-quality models increases 
to 0.786, 0.799 and 0.792, respectively. This represents roughly a 
10% improvement with respect to the original classification, 
suggesting that eMatchSite correctly recognizes similarities 
between different ligands containing the adenine moiety. Note 
that similar relationships were detected by applying MED-SuMo 
to purine-binding proteins from the PDB [29]. The classification 
of their binding sites revealed a number of distinct clusters, many 
of which are heterogeneous, i.e. linked to various kinds of purine- 



Table 4. Comparison of sequence order-independent binding site alignments constructed by SiteEngine and eMatchSite for 
adenine-binding proteins from the SOIPPA dataset. 





Algorithm 


Crystal structures 




High-quality models 


Moderate-quality models 




RMSD 


ARMSD* 


RMSD 


A RMSD" 


RMSD 


A RMSD 3 


SiteEngine 


5.63±3.37 


3.67±2.91 


6.78±3.29 


4.83 ±2. 92 


7.89±3.68 


6.01 ±3.46 


eMatchSite 


4.81 ±2.62 


2.85 ±2.40 


5.21 ±2.55 


3.26±2.33 


5.32±2.48 


3.44±2.22 



The alignment accuracy is assessed by the average ±standard deviation ligand heavy-atom RMSD calculated upon the superposition of aligned binding residues. 
a ARMSD is calculated by subtracting from RMSD a ligand heavy-atom root-mean-square deviation upon the superposition of two ligands. 
doi:1 0.1 371 /journal.pcbi.1 003829.t004 
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ABC 




Figure 5. Performance comparison for eMatchSite, PocketMatch, SiteEngine and sup-CK. Binding site matching is conducted using the 
(A-C) Kahraman and (D-F) Homogeneous datasets. The accuracy of local alignment predictors is compared to that using global sequence and 
structure alignments for (A, D) crystal target structures, (B, E) high-, and (C, F) moderate-quality protein models. TPR and FPR are the true and false 
positive rates, respectively; gray area corresponds to a random prediction. 
doi:10.1371/journal.pcbi.1003829.g005 




Figure 6. Performance of eMatchSite, PocketMatch and SiteEngine on the Steroid dataset. The accuracy of local alignment predictors is 
compared to that using global sequence and structure alignments for (A) crystal target structures, (B) high-, and (C) moderate-quality protein 
models. TPR and FPR are the true and false positive rates, respectively; gray area corresponds to a random prediction. 
doi:1 0.1 371 /journal.pcbi.1 003829.g006 
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containing ligands. Finally, we analyze separately adenine-binding 
and other proteins from the Kahraman dataset. Figure S5 shows 
that «MatchSite gives the best performance for both subsets across 
different quality target structures. 

The global structures of proteins binding different ligands in the 
Homogeneous dataset are notably more similar to each other than 
those from the Kahraman dataset. This explains a fairly high 
accuracy of global structure alignments shown in Figures 5D-F for 
the target crystal structures, high- and moderate-quality models; 
here, the corresponding AUC values are 0.835, 0.810 and 0.808, 
respectively (Table S3). The performance of PocketMatch, 
SiteEngine and sup-CK is similar, with the latter providing a 
slightly higher accuracy; however, it is still lower compared to the 
global structure alignments. In contrast, the accuracy of 
eMatchSite is significandy higher that using global as well as local 
alignment predictors. Furthermore, the performance differences 
increase when modeled structures are used as the targets; for 
instance, the AUC for «MatchSite is 11.2% (18.8%), 15.9% 
(21.0%) and 30.1% (25.0%) larger than that for sup-CK 
(PocketMatch) using crystal structures, high- and moderate-quality 
models, respectively. 

Performance on the Steroid dataset 

The last comparison is carried out using a dataset of steroid- 
binding proteins and a large set of control proteins that bind 
chemically dissimilar ligands, whose size is comparable to that of 
estradiol. Figure 6 shows the performance of eMatchSite com- 
pared to two global similarity-based approaches as well as two 
binding site matching algorithms, PocketMatch and SiteEngine 
(the corresponding AUC values are reported in Table S3). As for 
the other datasets, we use three conformations of the target 
proteins, crystal structures, high- and moderate-quality models, 
which is shown in Figures 6A, 6B and 6C, respectively. Binding 
site matching approaches perform better than the sequence-based 
approach across all datasets of target structures. The accuracy of 
PocketMatch, SiteEngine and the structure-based approach are 
fairly comparable, except for the target crystal structures, for 
which the structure-based approach performs better than Pock- 
etMatch and SiteEngine. The AUC values for ^MatchSite are 
notably higher than those for PocketMatch and SiteEngine by 7- 
15% using target crystal structures and high-quality models, and 
by 20-26% using moderate-quality models. These results are 
qualitatively similar to those obtained for the SOIPPA, Kahraman 
and Homogeneous datasets and further demonstrate that 
eMatchSite is less sensitive to structural distortions in target 
proteins compared to other approaches. 

Availability and Future Directions 

In this study, we describe «MatchSite, a new method for 
calculating the sequence order-independent alignments of ligand 
binding sites in proteins. This approach employs a set of residue- 
level scores derived from evolutionarily related templates and 
machine learning to estimate inter-residue distances upon the 
optimal superposition of ligand-binding sites. From these distanc- 
es, local binding site alignments are constructed by the Kuhn- 
Munkres algorithm. In addition to the alignments, «MatchSite 
provides a calibrated significance score, which effectively identifies 
those pockets binding chemically similar ligands regardless of any 
global sequence and structure similarities between the target 
proteins. Benchmarking calculations are performed using four 
datasets of globally unrelated proteins that bind similar ligands. 
Compared to several other algorithms for ligand binding site 
matching, eMatchSite offers two unique features. The first is a 



high tolerance to structural deformations in ligand binding regions 
in protein models. For example, eMatchSite generates accurate 
alignments of adenine-binding pockets in crystal structures for 
almost three times more benchmarking protein pairs than 
SOIPPA. Moreover, the percentage of correctly aligned pairs of 
adenine-binding sites in weakly homologous protein models is only 
4—9% lower than those aligned using crystal structures. This 
represents a significant improvement over other algorithms, e.g. 
the performance of eMatchSite in recognizing similar binding sites 
is 6% and 13% higher than that for SiteEngine using high- and 
moderate-quality protein models, respectively. Many proteins are 
known to undergo conformational changes upon ligand binding, 
however, a high tolerance to structural distortions in protein 
models suggests that eMatchSite will work well with ligand-free 
experimental structures as well. The second feature is the 
applicability to predicted pockets that may contain inaccuracies 
in the definition of binding residues. In general, the accuracy of 
binding residue prediction depends on the quality of target 
structures [39], thus using better models results in more accurate 
local alignments of their binding sites. Moreover, using binding 
residues predicted by evolution/structure-based approaches, such 
as «FindSite [39], may yield better accuracy in pocket matching 
between members of highly conserved protein families. These 
residues correlate with the conserved aspects of molecular function 
and are independent on the size of a particular ligand that was co- 
crystallized with the target protein. In addition, if ligand binding 
occurs outside cavities in protein structures, the corresponding 
binding residues can still be correctly identified by eFindSite as 
long as these are functionally conserved across sets of evolution- 
arily related proteins. Since <?MatchSite includes strong evolution- 
ary components in its scoring function, we should expect more 
accurate results for those target proteins belonging to functionally 
conserved families with distinct ligand binding patterns. 

Constructing biologically correct alignments using predicted 
ligand binding sites in protein models opens up the possibility of 
investigating drug-protein interaction networks for complete 
proteomes. The prospective systems-level applications of 
eMatchSite include the development of safer biopharmaceuti- 
cals with reduced side effects, polypharmacology and rational 
drug repositioning. eMatchSite is freely available to the 
academic community as a web-server and a stand-alone 
software package at http://www.brylinski.org/ematchsite. This 
website also provides a complete documentation including 
walkthrough tutorials and case studies demonstrating the 
installation and execution procedures as well as the interpreta- 
tion of results. 

Supporting Information 

Figure SI Effects of target structure distortions on the quality of 
local ligand binding site alignments. MCC is Matthew's correla- 
tion coefficient calculated against the reference alignments 
constructed using target crystal structures. Alignment accuracy is 
assessed separately for different ligands from the SOIPPA dataset: 
(A) ADP, (B) ATP, (C) FAD, (D) NAD, (E) SAH, and (F) SAM. 
(TIF) 

Figure S2 Correlation between the actual pairwise Coc-Ca 
distances upon the reference alignment of binding sites and those 
predicted by SVR. The correlation is plotted separately for 
different ligands from the SOIPPA dataset, ADP, ATP, FAD, 
NAD, SAH, and SAM, using (A) target crystal structures, (B) high- 
and (C) moderate-quality models, as well as structures distorted to 
(D) 2 A, (E) 4 A and (F) 6 A Coc-RMSD. 
(TIF) 
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Figure S3 ROC plots for the prediction of equivalent residue 
pairs using SVC and different quality target structures. The 
accuracy is assessed separately for different ligands from the 
SOIPPA dataset, (A) ADP, (B) ATP, (C) FAD, (D) NAD, (E) SAH, 
and (F) SAM. TPR and FPR are the true and false positive rates, 
respectively; gray area corresponds to a random prediction. 
(TIF) 

Figure S4 Performance of «MatchSite, PocketMatch and 
SiteEngine on the SOIPPA dataset of adenine-binding proteins. 
The accuracy of local alignment predictors is compared to that 
using global sequence and structure alignments for (A) crystal 
target structures, (B) high- and (C) moderate-quality protein 
models, as well as structures distorted to (D) 2 A, (E) 4 A and (F) 
6 A Ca-RMSD. TPR and FPR are the true and false positive 
rates, respectively; gray area corresponds to a random prediction. 
(TIF) 

Figure S5 Performance of ^MatchSite, PocketMatch, SiteEn- 
gine and sup-CK on the Kahraman dataset. Binding site matching 
is conducted using (A-C) adenine-binding and (D-F) other 
proteins. The accuracy of local alignment predictors is compared 
to that using global sequence and structure alignments for (A, D) 
crystal target structures, (B, E) high-, and (C, F) moderate-quality 
protein models. TPR and FPR are the true and false positive rates, 
respectively; gray area corresponds to a random prediction. 
(TIF) 

Table SI Global and local structure quality of benchmarking 
proteins from the SOIPPA, Kahraman, Homogeneous and 
Steroid datasets. High- and moderate-quality models are con- 
structed by ^Thread. Distorted structures were generated by 
deforming the crystal structures to a desired Ca-RMSD. Ligand 
binding sites and residues are detected by «FindSite. 
(PDF) 

Table S2 Accuracy of inter-residue distance prediction for 
adenine-binding proteins from the SOIPPA dataset. The Pearson 
correlation coefficient (PCC) and the mean squared error (MSE) 
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